{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "m0qQu-luFmB5"
      },
      "source": [
        "# Split one COCO annotation JSON file into training and validation JSON files."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "9NGkWKGrF3pc"
      },
      "source": [
        "Given a single COCO annotated JSON file, your goal is to split them into training and validation COCO annotated JSON files.\n",
        "\n",
        " A single JSON file needs to be split into training and validation files. The output files will be further converted to TFRecord files using another notebook.\n",
        "\n",
        "This notebook uses a third party library to accomplish this task. The library can split the JSON files according to the ratio. We kept the validation file to contain 20% of the data.\n",
        "\n",
        "This notebook is an end to end example. When you run the notebook, it will take one JSON file and will split into a train and a val JSON file."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "GIjj-vE-n1e3"
      },
      "source": [
        "**Note** - In this example, we assume that all our data is saved on Google drive and we will also write our outputs to Google drive. We also assume that the script will be used as a Google Colab notebook. But this can be changed according to the needs of users. They can modify this in case they are working on their local workstation, remote server or any other database. This colab notebook can be changed to a regular jupyter notebook running on a local machine according to the need of the users."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "QElyM7FtWv5E"
      },
      "source": [
        "## **MUST DO** - Install and restart runtime"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "WMy_xu64FJ1j"
      },
      "outputs": [],
      "source": [
        "# install python object detection insights library to merge multiple COCO annotation files\n",
        "!pip install pyodi\n",
        "\n",
        "# RESTART THE RUNTIME in order to use this library"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "tySpWIuVFPj0"
      },
      "source": [
        "## Run the below command to connect to your google drive"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "RfJAkMY9FSPz"
      },
      "outputs": [],
      "source": [
        "# import other libraries\n",
        "from google.colab import drive\n",
        "import pyodi\n",
        "import subprocess\n",
        "import sys\n",
        "import os\n",
        "import json\n",
        "import numpy as np\n",
        "import pandas as pd"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "AOLmsOOZFVdJ",
        "outputId": "f7f6dba8-0872-4d21-d55d-2b95c42a06a4"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "Mounted at /content/gdrive\n",
            "Successful\n"
          ]
        }
      ],
      "source": [
        "# connect to google drive\n",
        "drive.mount('/content/gdrive')\n",
        "\n",
        "# making an alias for the root path\n",
        "try:\n",
        "  !ln -s /content/gdrive/My\\ Drive/ /mydrive\n",
        "  print('Successful')\n",
        "except Exception as e:\n",
        "  print(e)\n",
        "  print('Not successful')"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "v0vJRt5qUOD_"
      },
      "source": [
        "## Visualization function"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "HbNMcLBmUOZ2"
      },
      "outputs": [],
      "source": [
        "def data_creation(path: str) -\u003e pd.DataFrame:\n",
        "  \"\"\"Create a dataframe with the occurences of images and categories.\n",
        "  Args:\n",
        "    path: path to the annotated JSON file.\n",
        "  Returns:\n",
        "    dataset consisting of the counts of images and categories.\n",
        "  \"\"\"\n",
        "  # get annotation file data into a variable\n",
        "  with open(path) as json_file:\n",
        "    data = json.load(json_file)\n",
        "\n",
        "  # count the occurance of each category and an image in the annotation file\n",
        "  category_names = [i['name'] for i in data['categories']]\n",
        "  category_ids = [i['category_id'] for i in data['annotations']]\n",
        "  image_ids = [i['image_id'] for i in data['annotations']]\n",
        "\n",
        "  # create a dataframe\n",
        "  df = pd.DataFrame(\n",
        "      list(zip(category_ids, image_ids)), columns=['category_ids', 'image_ids'])\n",
        "  df = df.groupby('category_ids').agg(\n",
        "      object_count=('category_ids', 'count'),\n",
        "      image_count=('image_ids', 'nunique'))\n",
        "  df = df.reindex(range(1, len(data['categories']) + 1), fill_value=0)\n",
        "  df.index = category_names\n",
        "  return df\n",
        "\n",
        "def visualize_detailed_counts_horizontally(path: str) -\u003e None:\n",
        "  \"\"\"Plot a vertical bar graph showing the counts of images \u0026 categories.\n",
        "  Args:\n",
        "    path: path to the annotated JSON file.\n",
        "  \"\"\"\n",
        "  df = data_creation(path)\n",
        "  ax = df.plot(\n",
        "      kind='bar',\n",
        "      figsize=(40, 10),\n",
        "      xlabel='Categories',\n",
        "      ylabel='Counts',\n",
        "      width=0.8,\n",
        "      linewidth=1,\n",
        "      edgecolor='white')  # rot = 0 for horizontal labeling\n",
        "  for p in ax.patches:\n",
        "    ax.annotate(\n",
        "        text=np.round(p.get_height()),\n",
        "        xy=(p.get_x() + p.get_width() / 2., p.get_height()),\n",
        "        ha='center',\n",
        "        va='top',\n",
        "        xytext=(4, 14),\n",
        "        textcoords='offset points')"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "gmTfyRQo9pT3"
      },
      "source": [
        "## Define the paths of inputs and outputs"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "nl5MrEPR9q9x"
      },
      "outputs": [],
      "source": [
        "input_file = '/mydrive/TFHub/jsons/merged.json' #@param {type:\"string\"}\n",
        "output_folder = '/mydrive/TFHub/jsons/' #@param {type:\"string\"}"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "2E7P4_2eFaPB"
      },
      "source": [
        "## Split coco annotation file into train and val COCO files"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "9HLYrO4JGKFm",
        "outputId": "a31b04fa-0d7c-4c22-cd18-58672d5a29e7"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "\u001b[32m2022-09-09 21:40:00.173\u001b[0m | \u001b[1mINFO    \u001b[0m | \u001b[36mpyodi.apps.coco.coco_split\u001b[0m:\u001b[36mrandom_split\u001b[0m:\u001b[36m183\u001b[0m - \u001b[1mGathering images...\u001b[0m\n",
            "\u001b[32m2022-09-09 21:40:00.192\u001b[0m | \u001b[1mINFO    \u001b[0m | \u001b[36mpyodi.apps.coco.coco_split\u001b[0m:\u001b[36mrandom_split\u001b[0m:\u001b[36m194\u001b[0m - \u001b[1mGathering annotations...\u001b[0m\n",
            "\u001b[32m2022-09-09 21:40:11.078\u001b[0m | \u001b[1mINFO    \u001b[0m | \u001b[36mpyodi.apps.coco.coco_split\u001b[0m:\u001b[36mrandom_split\u001b[0m:\u001b[36m218\u001b[0m - \u001b[1mSaving splits to file...\u001b[0m\n",
            "/mydrive/TFHub/jsons/_train.json\n",
            "/mydrive/TFHub/jsons/_val.json\n"
          ]
        }
      ],
      "source": [
        "# split a COCO annotation file into train and val files\n",
        "!pyodi coco random-split $input_file $output_folder --val-percentage 0.2\n",
        "\n",
        "# there will be two files with name '_train.json' and '_val.json' in the output_folder"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "wLnDJLIuMf8o"
      },
      "source": [
        "## Visualization"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "2dNcl3XCMLDX"
      },
      "outputs": [],
      "source": [
        "# visualization of the input COCO annotated JSON file\n",
        "visualize_detailed_counts_horizontally(input_file)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "GHZZ3aLbMO35"
      },
      "outputs": [],
      "source": [
        "# visualization of the training COCO annotated JSON file\n",
        "print('Train JSON')\n",
        "visualize_detailed_counts_horizontally(output_folder + '_train.json')\n",
        "\n",
        "print('Validation JSON')\n",
        "# visualization of the validation COCO annotated JSON file\n",
        "visualize_detailed_counts_horizontally(output_folder + '_val.json')"
      ]
    }
  ],
  "metadata": {
    "colab": {
      "provenance": []
    },
    "gpuClass": "standard",
    "kernelspec": {
      "display_name": "Python 3",
      "name": "python3"
    },
    "language_info": {
      "name": "python"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}
